u … error term, which contains unobserved factors (lack of data), like moral character, wage in criminal activity, family background, etc. Oddly enough, it is this error term, which attracts the most attention in econometrics
Econometric model of job training and worker productivity
wage = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 training + u
wage … hourly wage
educ … years in formal education
exper … years of workforce experience
training … weeks spent in job training
u … error term representing unobserved determinants of the wage like innate ability, quality of education, family background
\text{ }
As mentioned above, most of econometrics deals with the specification of the error u. As we will see, this is essential for a causal interpretation of the estimates
Econometric models may also be used for hypothesis testing
For example, the parameter \beta_3 represents the effect of training on wages
How large is this effect? Is it even different from zero?
1.1.3 Data
Econometric analysis requires data and there are different kinds of economic data sets
Cross-sectional data
Time series data
Pooled cross sections
Panel/Longitudinal data
Econometric methods depend on the nature of the data used
Different data sets lead to different estimation problems. Use of inappropriate methods may lead to misleading results
Cross-sectional data sets
Sample of individuals, households, firms, cities, states, countries or other units of interest at a given point of time/in a given period
Cross-sectional observations are more or less independent
For example, pure random sampling from a population
Sometimes pure random sampling is violated, e.g., units refuse to respond in surveys, or if sampling is characterized by clustering (this usually leads to autocorrelation, heteroscedasticity or sample selection problems)
Cross-sectional data are typically encountered in applied microeconomics
# Cross-sectional data set on wages and other characteristics. Look especially at indicator variableslibrary(wooldridge)data(wage1)head(wage1, 10)
# orlibrary(gt)# for pretty html-table plotsgt(head(wage1,10))
wage
educ
exper
tenure
nonwhite
female
married
numdep
smsa
northcen
south
west
construc
ndurman
trcommpu
trade
services
profserv
profocc
clerocc
servocc
lwage
expersq
tenursq
3.10
11
2
0
0
1
0
2
1
0
0
1
0
0
0
0
0
0
0
0
0
1.131402
4
0
3.24
12
22
2
0
1
1
3
1
0
0
1
0
0
0
0
1
0
0
0
1
1.175573
484
4
3.00
11
2
0
0
0
0
2
0
0
0
1
0
0
0
1
0
0
0
0
0
1.098612
4
0
6.00
8
44
28
0
0
1
0
1
0
0
1
0
0
0
0
0
0
0
1
0
1.791759
1936
784
5.30
12
7
2
0
0
1
1
0
0
0
1
0
0
0
0
0
0
0
0
0
1.667707
49
4
8.75
16
9
8
0
0
1
0
1
0
0
1
0
0
0
0
0
1
1
0
0
2.169054
81
64
11.25
18
15
7
0
0
0
0
1
0
0
1
0
0
0
1
0
0
1
0
0
2.420368
225
49
5.00
12
5
3
0
1
0
0
1
0
0
1
0
0
0
0
0
0
1
0
0
1.609438
25
9
3.60
12
26
4
0
1
0
2
1
0
0
1
0
0
0
1
0
0
1
0
0
1.280934
676
16
18.18
17
22
21
0
0
1
0
1
0
0
1
0
0
0
0
0
0
1
0
0
2.900322
484
441
Time series data
Observations of a variable or several variables over time
For example, stock prices, money supply, consumer price index, gross domestic product, annual homicide rates, automobile sales, …
Time series observations are typically serially correlated
Ordering of observations conveys important information
Data frequency: daily, weekly, monthly, quarterly, annually, high frequency data
Typical features of time series: trends and seasonality
Typical applications: applied macroeconomics and finance
# Time series data on minimum wages and related variables for Puerto Ricolibrary(gt)# for pretty html-table plotslibrary(wooldridge)data(prminwge)gt(prminwge[1:20, c("year", "avgmin", "avgcov", "prunemp", "prgnp")])
year
avgmin
avgcov
prunemp
prgnp
1950
0.198
0.201
15.4
878.7
1951
0.209
0.207
16.0
925.0
1952
0.225
0.226
14.8
1015.9
1953
0.311
0.231
14.5
1081.3
1954
0.313
0.224
15.3
1104.4
1955
0.369
0.236
13.2
1138.5
1956
0.447
0.245
13.3
1185.1
1957
0.488
0.244
12.8
1221.8
1958
0.555
0.238
14.2
1258.4
1959
0.588
0.260
13.3
1363.6
1960
0.616
0.270
11.8
1473.2
1961
0.608
0.269
12.7
1562.8
1962
0.707
0.279
12.8
1683.9
1963
0.723
0.279
11.0
1820.7
1964
0.809
0.294
11.2
1916.8
1965
0.834
0.302
11.7
2083.0
1966
0.854
0.444
12.3
2223.2
1967
0.971
0.448
11.6
2328.4
1968
1.104
0.455
10.3
2455.3
1969
1.149
0.455
10.3
2684.0
Pooled cross sections
Two or more cross sections are combined in one data set
Cross sections are drawn independently of each other
Pooled cross sections often used to evaluate policy changes
Example:
Evaluate effect of change in property taxes on house prices
Random sample of house prices for the year 1993
A new random sample of house prices for the year 1995
Compare before/after (1993: before reform, 1995: after reform)
Panel or longitudinal data
The same cross-sectional units are followed over time. Therefore, wide panels are basically pooled crossections with the very same units (which are many)
Long panels are time series for several units (e.g., countries or counties)
Panel data have a cross-sectional and a time series dimension. So we have two id-variables
Panel data can be used to account for time-invariant unobservable factors
Panel data can also be used to model lagged responses
Example:
City crime statistics; each city is observed for serveral years
Time-invariant unobserved city characteristics may be modeled
Effect of police on crime rates may exhibit time lag
# Panel data set on city crime statisticslibrary(wooldridge)data(countymurders)gt(countymurders[(countymurders$year>=1990&countymurders$countyid<=1005), c("countyid", "year", "murders", "popul", "percblack", "percmale", "rpcpersinc")])
countyid
year
murders
popul
percblack
percmale
rpcpersinc
1001
1990
1
34512
20.19000
40.46000
10975.24
1001
1991
1
35024
20.27000
40.48000
11152.39
1001
1992
1
35560
20.34000
40.51000
11263.97
1001
1993
1
37027
20.48505
48.68339
11312.82
1001
1994
1
38027
20.64849
48.71013
11541.15
1001
1995
5
38957
20.87686
48.72552
11680.74
1001
1996
7
40061
20.97551
48.70073
11852.76
1003
1990
7
99200
13.01000
41.30000
11600.30
1003
1991
3
102224
13.04000
41.37000
11854.09
1003
1992
5
105344
13.07000
41.43000
12124.56
1003
1993
7
111018
13.17624
48.69210
12645.61
1003
1994
5
115266
13.28579
48.73163
13012.65
1003
1995
13
119373
13.42347
48.82176
13327.95
1003
1996
6
123023
13.49666
48.83233
13583.02
1005
1990
4
25532
44.22000
39.38000
9997.83
1005
1991
4
25728
44.44000
39.43000
10371.41
1005
1992
0
25932
44.67000
39.44000
11039.38
1005
1993
3
26461
45.28930
48.74721
10721.85
1005
1994
3
26445
45.70240
49.01115
10912.72
1005
1995
3
26337
46.00372
49.12860
10702.64
1005
1996
1
26475
46.19075
49.15203
10760.51
1.2 Causality
Definition of causal effect of x on y: \ \ x \rightarrow y
How does variableychange if variablexis changed but all other relevant factors are held constant
Most economic questions are ceteris paribus questions
It is useful to describe how an experiment would have to be designed to infer the causal effect in question (see examples below)
Simply establishing a relationship – correlation – between variables is not sufficient. Correlation alone says nothing about causality !!!
The question is, whether a found effect (correlation) between x and y can be considered as causal. There are several possibilities:
x \rightarrow y
x \leftarrow y
x \leftrightarrows y
z_j \rightarrow x \text{ and } z_j \rightarrow y, \ \ldots
If we have controlled for enough other variables z_j, then the estimated ceteris paribus effect can often be considered to be causal (but not always, as not all variables are observable) 1
However, it is typically difficult to establish causality and we always need some identifying assumptions, which should be credible
Figure 1.1: Does carring an umbrellar in the morning causes rainfall in the afternoon? What case?
Further examples
Causal effect of fertilizer on crop yield
“By how much will the production of soybeans increase if one increases the amount of fertilizer applied to the ground”
Implicit assumption: all other factors z_j that influence crop yield such as quality of land, rainfall, presence of parasites etc. are held fixed
Experiment:
Choose several one-acre plots of land; randomly assign different amounts of fertilizer to the different plots; compare yields
Experiment works because amount of fertilizer applied is unrelated to other factors (including the original crop yield y) influencing crop yields
Measuring the return to education
“If a person is chosen from the population and given another year of education, by how much will his or her wage increase?”
Implicit assumption: all other factors z_j that influence wages such as experience, family background, intelligence etc. are held fixed
Experiment:
Choose a group of people; randomly assign different amounts of education to them (infeasible!); compare wage outcomes
Problem without random assignment: amount of education is related to other factors that influence wages (e.g., intelligence or diligence);
this is a (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem
Effect of law enforcement on city crime level
“If a city is randomly chosen and given ten additional police officers, by how much would its crime rate fall?”
Alternatively: “If two cities are the same in all respects, except that city A has ten more police officers than city B, by how much would the two cities‘ crime rates differ?”
Experiment:
Randomly assign number of police officers to a large number of cities
In reality, number of police officers will be determined by crime rate – simultaneous determination of crime and number of police;
this is mainly a x \leftrightarrows y – problem
Effect of the minimum wage on unemployment
“By how much (if at all) will unemployment increase if the minimum wage is increased by a certain amount (holding other things fixed)?”
Experiment:
Government randomly chooses minimum wage each year and observes unemployment outcomes. The experiment will work because level of minimum wage is unrelated to other factors determining unemployment
In reality, the level of the minimum wage will depend on political and economic factors that also influence unemployment;
mainly a (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem
1.3 The Simple Regression Model
Definition of the simple linear regression model:
y = \beta_0 + \beta_1 x + u
\tag{1.1}
Thereby
\ y … Dependent variable, explained variable, response variable or regressand
\ x … Independent variable, explanatory variable or regressor
\ \beta_0 … Intercept
\ \beta_1 … Slope parameter
\ u … Error term, disturbance, unobserved factors with E(u)=0, which is not restrictive because of \beta_0
This is a simple regression model, because we have only one explanatory variable.
Equation 1.1 describes what change in y we can expect if x changes. If follows:
Interpretation of \beta_1: By how much does the dependent variable change (on average, as u always vary in some way) if the independent variable is increased by one unit?
This interpretation is only correct if all other things (contained in u) remain (on average) constant when the independent variable x is increased by one unit!
Remark: The simple linear regression model is rarely applicable in practice but its discussion is useful for pedagogical reasons
Using a simple regression model we usually have a \ (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem rendering the causal interpretation of \beta_1 incorrect in most cases
1.3.1 Some Examples
\text{ }
A simple wage equation: wage = \beta_0 + \beta_1 educ + u
\beta_1 measures the change in hourly wage given another year of education, holding all other factors fixed
u represents labor force experience, tenure with current employer, work ethic, intelligence, etc. \text{ }
Soybean yield and fertilizer:yield = \beta_0 + \beta_1 fertilizer + u
\beta_1 measures the effect of fertilizer on yield, holding all other factors fixed
u represents unobserved (or omitted) factors like Rainfall, land quality, presence of parasites, etc.
1.3.2 Conditional mean independence assumption
When is a causal interpretation of Equation 1.1 justified?
Conditional mean independence assumption
E(u \, | \, x) = E(u) = 0
\tag{1.2}
The explanatory variable must not contain any information about the mean of the unobserved factors in u
So knowing something about x doesn‘t give us information about u
This leads to \frac {dE(u \mid x)}{dx}=0 as required. If this assumption is satisfied, we actually have a (x \rightarrow y) – case
Regarding the wage examplewage = \beta_0 + \beta_1 educ + uability is likely an important, but often unobserved factor for the obtained wage of a particular individual. As ability is not an explicit variable in the model, it is contained within u
The conditional mean independence assumption is unlikely to hold in this case because individuals with more education will also be more capable on average. Knowing something about the education (variable x) of a particular individual therefore contains some information about the ability of that individual (which is in u)
Hence, E(u \, | \, x) \neq 0 easily possible in this case
Basically, we have the (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem, with z_j being ability
Regarding the fertilizer example a similar argument holds. Typically, a framer uses more fertilizer if the quality of the soil is bad. Therefore, quality of the soil, which is part of u, influences both crop yield and the amount of fertilizer used. Hence, we once again have a (z_j \rightarrow x \text{ and } z_j \rightarrow y) – problem, with z_j being quality of soil
And furthermore, E(u \, | \, x) \neq 0, as the amount of used fertlizer (variable x) gives as information about the quality of soil, which is part of u\; \Rightarrow \; conditional mean independence assumption is probably violated in this case
1.3.3 Population regression function (PRF)
Taking the conditional expectation of Equation 1.1 we arrive to the so called population (true) regression function
This means that the average value of the dependent variable can be expressed as a linear function of the explanatory variable and Equation 1.4 is, in a certain sense, the best possible predictor of y, given the information x and assumption Equation 1.2
Furthermore, \beta_1 = \dfrac {dE(y|x)}{dx} That means that a one-unit increase of x changes the conditional expected value (the average) of y by the amount of \beta_1 (if the conditional mean independence assumption is met)
For a given value of x, the distribution of y is centered around E(y|x), as illustrated by in Figure 1.2 which shows a graphical representation of the population regression function
Figure 1.2: Population regression line; Source: Wooldridge (2019)
1.3.4 Estimation
In order to estimate the regression model one needs data, i.e., a random sample of n observations (y_i, x_i), \ i=1, \ldots , n
The task is: Fit as good as possible a regression line through the data points which is an estimation of the PRF:
From these first order conditions above we immediately arrive the so called Normal Equations, which are two linear equations in the two variables \hat \beta_0 and \hat \beta_1
Regression line always passes through the sample midpoint (\bar x, \bar y), according Equation 1.10
The sum (and average) of the residuals is zero: \sum_{i=1}^n \hat u_i = 0 according to Equation 1.8 and the definition in Equation 1.6
Furthermore, the second normal equation, Equation 1.9 together with the definition of the residuals Equation 1.6 implies:
The regressor x_i and the regression residuals \hat u_i are orthogonal:
\sum_{i=1}^n x_i \hat u_i=0
i.e., are uncorrelated
This is the extreme important orthogonal property of OLS
Estimation by Methods of Moments
Another approach for estimating the (true) population parameters \beta_0 and \beta_1 is the method of moments procedure, MoM
The basis for this is the conditional mean independence assumption, Equation 1.2E(u \, | \, x) = E(u) = 0 This implies that the covariance between u and x is zero:
The method of moments approach to estimate the parameters imposes these two population moments restrictions on the sample data
In particular: the population moments are replaced by their sample counterparts
The justification is as follows: By the Law of Large Numbers, LLN, the sample moments converge to their population/theoretical counterparts under rather weak assumptions (stationarity, weak dependence). E.g., with increasing sample size n the sample mean of a random variable converge to the expectation of this random variable (compare Theorem A.2)
So we can estimate the population moments by the corresponding empirical moments. In particular, we estimate the expectation, E(y), with the arithmetic sample mean \bar y, knowing that by the LLN this sample estimator converges to E(y) with increasing sample size
Hence, the population moment conditions, Equation 1.12 and Equation 1.13, can be replaced (estimated) by their corresponding sample means:
However, the above conditions (which the parameters \beta_0 and \beta_1 have to meet) are exactly the same as the first order conditions from minimizing the sum of squared residuals, the normal equations, Equation 1.8 and Equation 1.9, and therefore yield the same solutions.
Hence, OLS and MoM estimation yield the very same estimated parameters \hat \beta_0 and \hat \beta_1 in this case. (For an additional analysis of MoM estimation, see Section 2.4.1)
Furthermore, the OLS estimator is also equal to the maximum likelihood estimator, ML, assuming normally distributed error terms
Maximum likelihood estimation is treated in more detail in Section 10.2. Intuitively, ML means that – for a given sample – the estimated parameters are chosen such that the probability of obtaining the respective sample is maximized
Under standard assumptions, OLS, MoM and ML estimators are equivalent (but generally, they can be different!)
1.3.5 An example in R
Install R from https://www.r-project.org
Install RStudio from https://rstudio.com/products/rstudio/download/#download
Start RStudio and install the packages AER and Wooldridge (which we will need very often). For that purpose go to the lower right window, choose the tab Packages, then the tab Install and enter AER and then click Install. If you are asked during the installation whether you want to compile code, type: no (in the lower left window). Repeat the same for the package Wooldridge
To input code use the upper left window. To execute code, mark the code in the upper left window and click on the tap Run at the top of the upper left window
You will see the results in the lower left window
To run the examples from these slides, simply copy the code from the slides (shaded in grey) into the upper left window, mark it and run it
We want to investigate to what extent the success in an election is determined by the expenditures during the campaign.
# We use a data set contained in the "Wooldridge" package # We already installed this package, however, if we want to use it in R, # we additionally have to load it with the library() commandlibrary(wooldridge)# Loading the data set "vote1" from the Wooldridge packages with the "data" commanddata(vote1)# printing out the first 6 observation of the data set "vote1" with the command "head()"head(vote1)
Running a regression of voteA on shareA with the command lm() (for linear model)
out<-lm(voteA~shareA, data=vote1)# We stored the results in a list with the freely chosen name "out"# With coef(out) we print out the estimated coefficients # Try to interpret the estimated coefficients coef(out)
(Intercept) shareA
26.8122141 0.4638269
# With fitted(out) we store the fitted valuesyhat<-fitted(out)# With residuals(out) we store the residualsuhat<-residuals(out)
Checking the orthogonal property of OLS – the correlation between explanatory variable x and the residuals \hat u.
This simple model for the success in an election seems very plausible, however it suffers from a very common problem
In this particular example, the conditional mean independence assumption is almost certainly violated. Why?
Because the campaign expenditures strongly depend on donations from supporters. The stronger a candidate is in a particular district the more donations he will get and the higher will be the potential campaign expenditures
Hence, we have a reversed causality problem here, \ x \leftrightarrows y, or a third variable problem z_j \rightarrow x \text{ and } z_j \rightarrow y, which both lead to E(u|x) \neq 0 in general
This probably will lead to a strong overestimation of the effects of campaign expenditures on votes in this particular case
Note that although x is very likely correlated with unobserved factors in u, the example above showed that the correlation betweenx and the sample residuals \hat uis zero – orthogonality property of OLS. Hence, this fact says nothing about whether the conditional mean independence assumption is satisfied or not
A possible remedy: Multiple regression model (with variables z as additional variables included in the set of explanatory variables) or tying to identify the x \rightarrow y relationship with external information (like instrumental variables; we will deal with this approach in Chapter 7)
1.3.6 Measures of Goodness-of-Fit
How well does the explanatory variable explain the dependent variable?
The R-squaredmeasures the fraction of the total variation iny that is explained by the regression
Example
# Running once more the regression of voteA on shareA with the command lm() out<-lm(voteA~shareA, data=vote1)# Printing a summary of the regressionsummary(out)
Call:
lm(formula = voteA ~ shareA, data = vote1)
Residuals:
Min 1Q Median 3Q Max
-16.8919 -4.0660 -0.1682 3.4965 29.9772
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.81221 0.88721 30.22 <2e-16 ***
shareA 0.46383 0.01454 31.90 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.385 on 171 degrees of freedom
Multiple R-squared: 0.8561, Adjusted R-squared: 0.8553
F-statistic: 1018 on 1 and 171 DF, p-value: < 2.2e-16
# Caution: A high R-squared does not mean that the regression has a causal interpretation!
1.3.7 Statistical Properties of OLS
The OLS parameters estimates (estimated coefficients) are functions of random variables and thus random variables themselves
We are interested in the moments and the distribution of the estimated coefficients, especially in the expectations and variances
Three questions are of particular interest:
Are the OLS estimates unbiased, i.e., E(\hat \beta_i) = \beta_i \, ?
How precise are our parameter estimates, i.e., how large is their variance \operatorname {Var}(\hat \beta_i) \; ?
How are the estimated OLS coefficients distributed?
Unbiasedness of OLS
Theorem 1.1 (Unbiasedness of OLS) Given a random sample and conditional mean independence of u_i from x we state:
The values of the explanatory variable must not contain any information about the variability of the unobserved factors
Together with the conditional mean independence assumption this furthermore implies that the conditional variance of u is also equal to the unconditional variance of u
This estimator turns out to be unbiased under or assumptions (see Theorem 2.1)
Note that we divide by (n-2) and not by n to calculate the average above. The reason is that for calculating \hat u_i, we priorly need to estimate two parameters, \beta_0, \beta_1. This means, that knowing this two estimated parameters, only (n-2) observations are informative – if we take these two estimated parameters together with (n-2) observations we could infer the remaining two observations. Therefore, this last two observations contain no additional information
The number (n-2), which is the number of observations minus the number of estimated model parameters, is referred to as degrees of freedom
Standard errors for regression coefficients
Having an estimate for \sigma^2 and the standard error S.E., we are able to estimate the standard errors of the parameter estimates
Calculation of standard errors for regression coefficients
The estimated standard deviations of the regression coefficients are called standard errors. They measure how precisely the regression coefficients are estimated
The following figures should illustrate the theoretical concepts discussed above
Code
## Monte Carlo simulation for regressions with one explanatory variable ##################### definition of function #########################################sims<-function(n=120, rep=5000, sigx=1, sig=1){set.seed(13468)# seed for random number generator# true parametersB0=0B1=0.5OLS<-vector(mode ="list", length =rep)# initialing list for storing resultsSOLS<-vector(mode ="list", length =rep)# initialing list for storing resultsOLS1<-vector(mode ="list", length =rep)# initialing list for storing resultsSOLS1<-vector(mode ="list", length =rep)# initialing list for storing results######################### rep loop #################################################for(iin(1:rep)){x=rnorm(n, mean =0, sd =sigx)u=rnorm(n, mean =0, sd =sig)u1=u/2maxx=max(x)minx=min(x)y=B0+B1*x+uy1=B0+B1*x+u1maxy=max(y)miny=min(y)OLS[[i]]=lm(y~x, model =FALSE)OLS1[[i]]=lm(y1~x, model =FALSE)}########################## end rep loop ##################################################################### drawing plots ############################################# scatterplot with true and estimated reg-line for last regressionplot(y~x, col="blue")abline(OLS[[i]], col="blue")abline(c(B0,B1), col="red")# rep > 100: histogram of estimated parameter b1if(rep>100){b1_distribution<-sapply(OLS, function(x)coef(x)[2])hist(b1_distribution, breaks =30, main="")abline(v=B1, col ="red")}# true and up to 100 estimated reg-lines plot(NULL, col="blue", xlim =c(minx*1.1, maxx*1.1), ylim =c(miny*1.1, maxy*1.1), ylab="y", xlab="x")for(iin1:min(100, rep))abline(OLS[[i]], col="lightgrey")points(y~x, col="blue", xlim =c(minx*1.1, maxx*1.1), ylim =c(miny*1.1, maxy*1.1))abline(c(B0,B1), col="red")# true and up to 100 estimated reg-lines, smaller sigplot(NULL, col="blue", xlim =c(minx*1.1, maxx*1.1), ylim =c(miny*1.1, maxy*1.1), ylab="y1", xlab="x")for(iin1:min(100, rep))abline(OLS1[[i]], col="lightgrey")points(y1~x, col="blue", xlim =c(minx*1.1, maxx*1.1), ylim =c(miny*1.1, maxy*1.1))abline(c(B0,B1), col="red")}######################### end of function ############################################## Calling function `sims()` with default values for parameterssims()
(a) Population regression function (red) and estimated regression function of a particular sample (blue), 120 observations, compare Figure 1.2 and Figure 1.3
(b) Unbiasedness: Histogram of 5000 estimates of \beta_1 based on random draws of u and x with 120 observations each. True value of \beta_1 is 0.5
(c) Variance of \hat \beta_1: 100 random samples with 120 observations each. PRF in red, estimated regression functions in grey
(d) Variance of \hat \beta_1 with smaller {\sigma} / {\sigma_x}: 100 random samples with 120 observations each. PRF in red, estimated regression functions in grey. Variance of \hat \beta_1 is much smaller
Figure 1.6: Population regression function (PRF), estimated regression functions, unbiasedness and variance of estimates
1.3.8 Example once more
We repeat the regression output from our voting example. Look for the new concepts we just discussed in the regression output shown blow.
library(modelsummary)# Running once more the regression of voteA on shareA with the command lm() out<-lm(voteA~shareA, data=vote1)modelsummary(list("Vote for candidate A"=out), shape =term~statistic, statistic =c('std.error', 'statistic', 'p.value', 'conf.int'), stars =TRUE, gof_omit ="A|L|B|F", align ="ldddddd", output ="gt")
Vote for candidate A
Est.
S.E.
t
p
2.5 %
97.5 %
(Intercept)
26.812***
0.887
30.221
<0.001
25.061
28.564
shareA
0.464***
0.015
31.901
<0.001
0.435
0.493
Num.Obs.
173
R2
0.856
RMSE
6.35
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Wooldridge, Jeffrey M. 2019. Introductory Econometrics: A Modern Approach. Seventh ed. Boston: Cengage.
If every variable z_j, which influences both x and y is known and observable, x \leftrightarrows y reduces to a z_j \rightarrow x \text{ and } z_j \rightarrow y – problem.↩︎
---title-block-banner: truesubtitle: "Based on @wooldridge_Intro_Econometrics, Chaptes 1 and 2"---# Simple Regressions\pagebreak## Models and Data#### **What is econometrics?**- Econometrics = use of *statistical methods to analyze economic data* - Econometric methods are used in many other fields, like social science, medicine, ect.- Econometricians typically analyze **nonexperimental** data#### **Typical goals of econometric analysis**- Estimating relationships between economic variables- Testing economic theories and hypothesis- Forecasting economic variables- Evaluating and implementing government and business policy#### **Steps in econometric analysis**1) Economic model (this step is often skipped)2) Econometric model------------------------------------------------------------------------### Economic models- Micro- or macromodels, growth models, models of open economies, etc.- Often use optimizing behavior, equilibrium modeling, …- Establish relationships between economic variables- Examples: demand equations, pricing equations, Euler equations …#### Economic model of crime (Becker (1968))An equation for *criminal activity* is derived, based on **utility maximization** which results in$$y = f(x_1, x_2, \ldots , x_k)$$- *Dependent variable* - `y` = Hours spent in criminal activities- *Explanatory variables* $x_j$ - "Wage" of criminal activities - Wage for legal employment - Other income - Probability of getting caught - Probability of conviction if caught - Expected sentence - Family background - Talent for Crime, moral character- The functional form of the relationship is not specified- The equation above could have been postulated without economic modeling - But in this case, the model lacks a theoretical foundation - If we have a theoretical model, we can often derive the expected sign of the coefficients or even guess the magnitude - This can be compared to the estimated coefficients, and if the expectations are not met, we can search for a rationale------------------------------------------------------------------------#### Economic Model of job training and worker productivity- What is effect of additional training on worker productivity?- Formal economic theory not really needed to derive equation but is clearly possible:$$wage = f(educ, exper, \ldots , training)$$- *Dependent variable* - `wage` = hourly wage- *Explanatory variables* $x_j$ - `educ` = years of formal education - `exper` = years of work force experience - `training` = weeks spent in job training- Other factors may be relevant as well, but these are the most important (?)------------------------------------------------------------------------### Econometric models#### Econometric model of criminal activity- The functional form has to be specified- Variables may have to be approximated by other quantities (leading to *measurement error*s)$$crime = \beta_{0} + \beta_{1} { wage } + \beta_{2} { othinc } + \beta_{3} { freqarr } + \beta_{4} { freqconv } + \\ \beta_{5} { avgsen } + \beta_{6} { age } + u $$- `crime` ... measure of criminal activity- `wage` ... wage for legal employment- `othinc` ... other income- `freqarr` ... frequency of prior arrests- `freqcon` ...frequency of conviction- `avgsen` ... Average sentence length after conviction- `age` ... age- *u* ... error term, which contains **unobserved** factors (lack of data), like moral character, wage in criminal activity, family background, etc. Oddly enough, it is this error term, which attracts the most attention in econometrics------------------------------------------------------------------------#### Econometric model of job training and worker productivity$$wage = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 training + u$$- `wage` ... hourly wage- `educ` ... years in formal education- `exper` ... years of workforce experience- *training* ... weeks spent in job training- *u* ... error term representing **unobserved** determinants of the wage like innate ability, quality of education, family background$$\text{ }$$- As mentioned above, most of econometrics deals with the **specification of the error u**. As we will see, this is **essential** for a **causal interpretation** of the estimates- Econometric models may also be used for **hypothesis testing** - For example, the parameter $\beta_3$ represents the *effect of training on wages* - How large is this effect? Is it even different from zero?------------------------------------------------------------------------### Data- Econometric analysis requires data and there are different kinds of economic data sets - Cross-sectional data - Time series data - Pooled cross sections - Panel/Longitudinal data- Econometric methods depend on the nature of the data used - Different data sets lead to different estimation problems. Use of inappropriate methods may lead to misleading results- **Cross-sectional data sets** - Sample of individuals, households, firms, cities, states, countries or other units of interest at a given point of time/in a given period - Cross-sectional observations are more or less independent - For example, pure random sampling from a population - Sometimes pure random sampling is violated, e.g., units refuse to respond in surveys, or if sampling is characterized by clustering (this usually leads to *autocorrelation*, *heteroscedasticity* or *sample selection problems*) - Cross-sectional data are typically encountered in *applied microeconomics*------------------------------------------------------------------------```{r}#| comment: " "# Cross-sectional data set on wages and other characteristics. Look especially at indicator variableslibrary(wooldridge)data(wage1) head(wage1, 10)# orlibrary(gt) # for pretty html-table plotsgt(head(wage1,10))```------------------------------------------------------------------------- **Time series data** - Observations of a variable or several variables *over time* - For example, stock prices, money supply, consumer price index, gross domestic product, annual homicide rates, automobile sales, … - Time series observations are typically *serially correlated* - Ordering of observations conveys important information - Data frequency: daily, weekly, monthly, quarterly, annually, high frequency data - Typical features of time series: *trends* and *seasonality* - Typical applications: *applied macroeconomics and finance*------------------------------------------------------------------------```{r}# Time series data on minimum wages and related variables for Puerto Ricolibrary(gt) # for pretty html-table plotslibrary(wooldridge)data(prminwge)gt( prminwge[1:20, c("year", "avgmin", "avgcov", "prunemp", "prgnp")] )```------------------------------------------------------------------------- **Pooled cross sections** - Two or more cross sections are combined in one data set - Cross sections are drawn independently of each other - Pooled cross sections often used to *evaluate policy changes*- Example: - Evaluate effect of change in property taxes on house prices - Random sample of house prices for the year 1993 - A new random sample of house prices for the year 1995 - Compare before/after (1993: before reform, 1995: after reform)------------------------------------------------------------------------- **Panel or longitudinal data** - The **same** cross-sectional units are followed over time. Therefore, wide panels are basically pooled crossections with the very same units (which are many) - Long panels are time series for several units (e.g., countries or counties) - Panel data have a cross-sectional *and* a time series dimension. So we have two id-variables - Panel data can be used to account for *time-invariant unobservable factors* - Panel data can also be used to model lagged responses- Example: - City crime statistics; each city is observed for serveral years - Time-invariant unobserved city characteristics may be modeled - Effect of police on crime rates may exhibit time lag------------------------------------------------------------------------```{r}# Panel data set on city crime statisticslibrary(wooldridge)data(countymurders)gt( countymurders[ (countymurders$year >=1990& countymurders$countyid <=1005), c("countyid", "year", "murders", "popul", "percblack", "percmale", "rpcpersinc")] )```------------------------------------------------------------------------## CausalityDefinition of causal effect of $x$ on $y: \ \ x \rightarrow y$- **How does variable** $y$ **change if variable** $x$ **is changed but all other relevant factors are held constant** - Most economic questions are **ceteris paribus** questions - It is useful to describe how an *experiment* would have to be designed to infer the causal effect in question (see examples below)> **Simply establishing a relationship -- correlation -- between variables is not sufficient. Correlation alone says nothing about causality !!!**- The question is, whether a found effect (correlation) between $x$ and $y$ can be considered as **causal**. There are several possibilities: - $x \rightarrow y$ - $x \leftarrow y$ - $x \leftrightarrows y$ - $z_j \rightarrow x \text{ and } z_j \rightarrow y, \ \ldots$- If we have controlled for enough other variables $z_j$, then the estimated ceteris paribus effect can often be considered to be causal (but not always, as not all variables are observable) [^the-nature-of-econometrics-1]- However, it is typically difficult to establish causality and we **always** need some **identifying assumptions**, which should be credible[^the-nature-of-econometrics-1]: If **every** variable $z_j$, which influences both $x$ and $y$ is known and observable, $x \leftrightarrows y$ reduces to a $z_j \rightarrow x \text{ and } z_j \rightarrow y$ -- problem.------------------------------------------------------------------------### Some Examples#### **"Post hoc, ergo propter hoc" fallacy** [^the-nature-of-econometrics-2][^the-nature-of-econometrics-2]: Lat.: Danach, also deswegen.```{r, echo=FALSE, fig.align='center', fig.width=7.2, fig.asp=0.72}#| label: fig-causality#| fig-cap: "Does carring an umbrellar in the morning causes rainfall in the afternoon? What case?"# spurious regressionset.seed(12)umbrella <-runif(60, 0, 100)raining <-pmax( 0.5* umbrella +rnorm(60, 0, 9), 0 ) /12out <-lm(raining ~ umbrella)corr <-round( cor(raining,umbrella), digits =2)plot(raining ~ umbrella, xlim=c(0,100), ylim=c(0,5), xlab="Share of people who carry an umbrellar in the morning of a particular day [%]",ylab="Actuall raining in the afternoon of that day [l/sqm]")text(20, 4, labels =bquote( "Correlation ="~ .(corr) ) )abline(out)```------------------------------------------------------------------------#### **Further examples****Causal effect of fertilizer on crop yield**- “By how much will the production of soybeans increase if one increases the amount of fertilizer applied to the ground”\- Implicit assumption: all other factors $z_j$ that influence crop yield such as quality of land, rainfall, presence of parasites etc. are held fixed*Experiment*:- Choose several one-acre plots of land; *randomly* assign different amounts of fertilizer to the different plots; compare yields\- Experiment works because amount of fertilizer applied is *unrelated* to other factors (including the original crop yield $y$) influencing crop yields------------------------------------------------------------------------**Measuring the return to education**- “If a person is chosen from the population and given another year of education, by how much will his or her wage increase?”- Implicit assumption: all other factors $z_j$ that influence wages such as experience, family background, intelligence etc. are held fixed*Experiment*:- Choose a group of people; *randomly* assign different amounts of education to them (infeasible!); compare wage outcomes- Problem without random assignment: amount of education is related to other factors that influence wages (e.g., intelligence or diligence);\ this is a $(z_j \rightarrow x \text{ and } z_j \rightarrow y)$ -- problem------------------------------------------------------------------------**Effect of law enforcement on city crime level**- “If a city is randomly chosen and given ten additional police officers, by how much would its crime rate fall?”- Alternatively: “If two cities are the same in all respects, except that city A has ten more police officers than city B, by how much would the two cities‘ crime rates differ?”Experiment:- *Randomly* assign number of police officers to a large number of cities- In reality, number of police officers will be determined by crime rate -- simultaneous determination of crime and number of police;\ this is mainly a $x \leftrightarrows y$ -- problem------------------------------------------------------------------------**Effect of the minimum wage on unemployment**- “By how much (if at all) will unemployment increase if the minimum wage is increased by a certain amount (holding other things fixed)?”Experiment:- Government *randomly* chooses minimum wage each year and observes unemployment outcomes. The experiment will work because level of minimum wage is unrelated to other factors determining unemployment- In reality, the level of the minimum wage will depend on political and economic factors that also influence unemployment;\ mainly a $(z_j \rightarrow x \text{ and } z_j \rightarrow y)$ -- problem------------------------------------------------------------------------## The Simple Regression Model##### Definition of the simple linear regression model:$$y = \beta_0 + \beta_1 x + u $$ {#eq-1b.1}- Thereby - $\ y$ ... Dependent variable, explained variable, response variable or regressand\ - $\ x$ ... Independent variable, explanatory variable or regressor\ - $\ \beta_0$ ... Intercept\ - $\ \beta_1$ ... Slope parameter\ - $\ u$ ... Error term, disturbance, unobserved factors with $E(u)=0$, which is *not restrictive* because of $\beta_0$This is a simple regression model, because we have **only one** explanatory variable.- @eq-1b.1 describes what change in $y$ we can expect if $x$ changes. If follows: $$\dfrac {dE(y|x)}{dx} \ = \ \beta_1 + \dfrac {dE(u|x)}{dx} \ = \ \beta_1$$ as long as $\dfrac {dE(u|x)}{dx} = 0$- **Interpretation** of $\beta_1$: By how much does the dependent variable change (on average, as $u$ always vary in some way) if the independent variable is increased by one unit? - This interpretation is only correct if *all other things (contained in u) remain (on average) constant* when the independent variable $x$ is increased by one unit!**Remark**: The *simple* linear regression model is *rarely applicable* in practice but its discussion is useful for pedagogical reasons- Using a simple regression model we usually have a $\ (z_j \rightarrow x \text{ and } z_j \rightarrow y)$ -- problem rendering the causal interpretation of $\beta_1$ incorrect in most cases------------------------------------------------------------------------### Some Examples$$\text{ }$$ - **A simple wage equation**: $$wage = \beta_0 + \beta_1 educ + u$$ - $\beta_1$ measures the change in hourly wage given another year of education, holding all other factors fixed - $u$ represents labor force experience, tenure with current employer, work ethic, intelligence, etc. $$\text{ }$$ - **Soybean yield and fertilizer:** $$yield = \beta_0 + \beta_1 fertilizer + u$$ - $\beta_1$ measures the effect of fertilizer on yield, holding all other factors fixed - $u$ represents unobserved (or omitted) factors like Rainfall, land quality, presence of parasites, etc. ------------------------------------------------------------------------### Conditional mean independence assumption {#sec-causal_interpretation}When is a **causal** interpretation of @eq-1b.1 justified?- **Conditional mean independence assumption**$$E(u \, | \, x) = E(u) = 0 $$ {#eq-1b.2}- The explanatory variable must not contain any information about the mean of the unobserved factors in $u$ - So knowing something about $x$ doesn‘t give us information about $u$ - This leads to $\frac {dE(u \mid x)}{dx}=0$ as required. If this assumption is satisfied, we actually have a $(x \rightarrow y)$ -- case- Regarding the **wage example** $$wage = \beta_0 + \beta_1 educ + u$$ *ability* is likely an important, but often *unobserved* factor for the obtained wage of a particular individual. As ability is not an *explicit variable* in the model, it is contained within $u$ - The conditional mean independence assumption is unlikely to hold in this case because individuals with more education will also be more capable on average. Knowing something about the education (variable $x$) of a particular individual therefore contains some information about the ability of that individual (which is in $u$) - Hence, $E(u \, | \, x) \neq 0$ easily possible in this case - Basically, we have the $(z_j \rightarrow x \text{ and } z_j \rightarrow y)$ -- problem, with $z_j$ being ability- Regarding the **fertilizer example** a similar argument holds. Typically, a framer uses more fertilizer if the quality of the soil is bad. Therefore, quality of the soil, which is part of $u$, influences both crop yield and the amount of fertilizer used. Hence, we once again have a $(z_j \rightarrow x \text{ and } z_j \rightarrow y)$ -- problem, with $z_j$ being quality of soil - And furthermore, $E(u \, | \, x) \neq 0$, as the amount of used fertlizer (variable $x$) gives as information about the quality of soil, which is part of $u$ $\; \Rightarrow \;$ conditional mean independence assumption is probably violated in this case*** ### Population regression function (PRF)Taking the conditional expectation of @eq-1b.1 we arrive to the so called **population (true) regression function**$$E(y \, | \, x) \ = \ E(\beta_0 + \beta_1 x + u \, | \, x) \ = \ \beta_0 + \beta_1 x + \underbrace {E(u \, | \, x)}_{= \, 0} $$ {#eq-PRF}Because of @eq-1b.2, this implies$$E(y \, | \, x) \ = \ \beta_0 + \beta_1 x $$ {#eq-1b.4}- This means that the *average value* of the dependent variable can be expressed as a linear function of the explanatory variable and @eq-1b.4 is, in a certain sense, the best possible predictor of $y$, given the information $x$ and assumption @eq-1b.2- Furthermore, $$\beta_1 = \dfrac {dE(y|x)}{dx}$$ That means that a one-unit increase of $x$ changes the conditional expected value (the average) of $y$ by the amount of $\beta_1$ (if the conditional mean independence assumption is met) - For a given value of $x$, the distribution of $y$ is centered around $E(y|x)$, as illustrated by in @fig-fig2 which shows a graphical representation of the population regression function```{r, echo=FALSE, message=FALSE, echo=FALSE}#| fig.align: center#| fig.cap: "Population regression line; Source: @wooldridge_Intro_Econometrics"#| label: fig-fig2 library(magick)img =image_read("popregline.png")image_scale(img, "530")```*** ### Estimation- In order to estimate the regression model one needs data, i.e., a *random sample* of $n$ observations $(y_i, x_i), \ i=1, \ldots , n$- The task is: Fit **as good as possible** a *regression line* through the data points which is an **estimation** of the PRF:$$\hat y_i = \hat \beta_0 + \hat \beta_1 x_i $$ {#eq-PRF_estimate} - The following @fig-fig3 gives an illustration of this problem ```{r, echo=FALSE, message=FALSE, echo=FALSE}#| fig.align: center#| fig.cap: "Estimated regression line; Source: @wooldridge_Intro_Econometrics"#| label: fig-fig3 library(magick)img =image_read("residuals.png")image_scale(img, "520")#image_trim(img)```------------------------------------------------------------------------#### **Principle of ordinary least squares -- OLS**What does “*as good as possible*” mean?- We define the regression residuals $\hat u_i$ as (note, a hat, "\^", always denotes an estimated value)$$\hat u_i \ \equiv \ y_i - \hat y_i \ = \ y_i - \underbrace {\hat \beta_0 - \hat \beta_1 x_i}_{\hat y_i} $$ {#eq-1b.6}- We choose $\hat \beta_0$ and $\hat \beta_1$ so as to minimize the sum of squared regression residuals$$\underset {\hat \beta_0, \hat \beta_1} {\operatorname {min}} \ \sum_{i=1}^n \hat u_i^2 \ \ \rightarrow \ \ \hat \beta_0, \, \hat \beta_1 $$ {#eq-1b.7}- The resulting **first order conditions** are$$\dfrac {\partial}{\partial \hat \beta_0}\sum_{i=1}^n \hat u_i^2 \ = \ \dfrac {\partial}{\partial \hat \beta_0}\sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2 \ =$$$$\quad \quad \quad \sum_{i=1}^n -2 (y_i - \hat \beta_0 - \hat \beta_1 x_i) \overset {!}{=} 0$$ $$\dfrac {\partial}{\partial \hat \beta_1}\sum_{i=1}^n \hat u_i^2 \ = \ \dfrac {\partial}{\partial \hat \beta_1}\sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2 \ =$$$$\quad \quad \quad \quad \sum_{i=1}^n - 2x_i (y_i - \hat \beta_0 - \hat \beta_1 x_i) \overset {!}{=} 0 $$ *** From these *first order conditions* above we immediately arrive the so called **Normal Equations**, which are two linear equations in the two variables $\hat \beta_0$ and $\hat \beta_1$$$\sum_{i=1}^n (\underbrace {y_i - \hat \beta_0 - \hat \beta_1 x_i}_{\hat u_i})= 0 $$ {#eq-1b.8}$$\sum_{i=1}^n x_i ( {y_i - \hat \beta_0 - \hat \beta_1 x_i}) = 0 $$ {#eq-1b.9}- Dividing by $n$ we get from the first normal @eq-1b.8$$\frac {1}{n} \sum_{i=1}^n y_i - \hat \beta_0 - \hat \beta_1 \frac {1}{n}\sum_{i=1}^n x_i = 0$$- This imply$$\bar y = \hat \beta_0 + \hat \beta_1 \bar x \ \ \Rightarrow \ \ \hat \beta_0 = \bar y - \hat \beta_1 \bar x $$ {#eq-1b.10}*** For calculating the **slope parameter** $\beta_1$ we insert @eq-1b.10 into the **second normal equation**, @eq-1b.9$$\sum_{i=1}^n x_i (y_i - \underbrace {(\bar y - \hat \beta_1 \bar x)}_{\hat \beta_0} - \hat \beta_1 x_i) = 0$$- Dividing by $n$ and expanding the sum leads to$$\frac {1}{n} \sum_{i=1}^n x_i y_i - \bar y \frac {1}{n} \sum_{i=1}^n x_i + \hat \beta_1 \bar x \frac {1}{n} \sum_{i=1}^n x_i - \hat \beta_1 \frac {1}{n} \sum_{i=1}^n x_i^2 = 0 \ \ \Rightarrow $$$$\frac {1}{n} \sum_{i=1}^n x_i y_i - \bar y \bar x + \hat \beta_1 \bar x^2 - \hat \beta_1 \frac {1}{n} \sum_{i=1}^n x_i^2 = 0$$- Collecting terms by applying the "Steinerschen Verschiebungssatz" we get$$\frac {1}{n} \sum_{i=1}^n (x_i - \bar x) (y_i - \bar y) - \hat \beta_1 \frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = 0$$- This immediately leads to the **OLS formula for the slope parameter**::: {.callout-important appearance="simple" icon="false"}$$\hat \beta_1 \, = \, \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) (y_i - \bar y)}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \, = \, \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) y_i}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } $$ {#eq-beta1_hat}:::> This equals the **sample covariance** of $y$ and $x$ divided by the **sample variance** of $x$Formula @eq-beta1_hat is only defined if there is some variation in the explanatory variable $x$, i.e., the sample variance of x must not be zeroAfter having calculated $\hat \beta_1$ by the formula in @eq-beta1_hat we get $\hat \beta_0$ by inserting $\hat \beta_1$ into formula @eq-1b.10------------------------------------------------------------------------#### **Algebraic properties of OLS** {#sec-properties}The **first normal equation**, @eq-1b.8, imply:1. Regression line always passes through the sample midpoint $(\bar x, \bar y)$, according @eq-1b.102. The sum (and average) of the residuals is zero: $\sum_{i=1}^n \hat u_i = 0$ according to @eq-1b.8 and the definition in @eq-1b.6Furthermore, the **second normal equation**, @eq-1b.9 together with the definition of the residuals @eq-1b.6 implies:3. The regressor $x_i$ and the regression residuals $\hat u_i$ are *orthogonal*:\ $$ \sum_{i=1}^n x_i \hat u_i=0 $$ i.e., are **uncorrelated**> This is the extreme important **orthogonal property of OLS**------------------------------------------------------------------------#### **Estimation by Methods of Moments**Another approach for estimating the (true) population parameters $\beta_0$ and $\beta_1$ is the **method of moments** procedure, **MoM**- The basis for this is the **conditional mean independence assumption**, @eq-1b.2 $$E(u \, | \, x) = E(u) = 0$$ This implies that the covariance between $u$ and $x$ is zero:$$\operatorname {Cov}(x,u) \ = \ E \left[ (x-E(x)) \, (u-0) \right] \ = $$ $$E(x \, u) - E(x) \underbrace {E(u)}_0 \ = \ E(x \, u) \quad \Rightarrow $$ $$E(x \, u) = E_x [x \underbrace {E(u | x)}_0 ] = 0 $$- Hence, we have two (population) moment restrictions$$E(u) \ = \ E(\underbrace {y-\beta_0-\beta_1 x)}_u=0 $$ {#eq-1b.12}$$E(x \, u) \ = \ E[x \, (y-\beta_0-\beta_1 x)]=0 $$ {#eq-1b.13}------------------------------------------------------------------------The method of moments approach to estimate the parameters imposes these two population moments restrictions on the sample data - In particular: the *population moments* are **replaced** by their *sample counterparts* - The **justification** is as follows: By the **Law of Large Numbers**, **LLN**, the sample moments converge to their population/theoretical counterparts under rather weak assumptions (stationarity, weak dependence). E.g., with increasing sample size $n$ the sample mean of a random variable converge to the expectation of this random variable (compare @thm-LLN)- So we can *estimate the population moments by the corresponding empirical moments*. In particular, we estimate the expectation, E(y), with the arithmetic sample mean $\bar y$, knowing that by the LLN this sample estimator converges to E(y) with increasing sample size - Hence, the population moment conditions, @eq-1b.12 and @eq-1b.13, can be replaced (estimated) by their corresponding sample means:$$\frac {1}{n} \sum_{i=1}^n (y_i-\hat \beta_0-\hat \beta_1 x_i)=0 $$$$\frac {1}{n} \sum_{i=1}^n x_i \, (y_i-\hat \beta_0-\hat \beta_1 x_i)=0$$------------------------------------------------------------------------However, the above conditions (which the parameters $\beta_0$ and $\beta_1$ have to meet) are *exactly the same* as the first order conditions from minimizing the sum of squared residuals, the normal equations, @eq-1b.8 and @eq-1b.9, and therefore yield the same solutions. - Hence, OLS and MoM estimation yield the very same estimated parameters $\hat \beta_0$ and $\hat \beta_1$ in this case. (For an additional analysis of MoM estimation, see @sec-matrix)- Furthermore, the OLS estimator is also equal to the **maximum likelihood estimator**, **ML**, assuming normally distributed error terms - Maximum likelihood estimation is treated in more detail in @sec-MLE. Intuitively, ML means that – for a given sample – the estimated parameters are chosen such that the probability of obtaining the respective sample is maximized- Under standard assumptions, OLS, MoM and ML estimators are equivalent (but generally, they can be different!)------------------------------------------------------------------------### An example in R- Install R from https://www.r-project.org- Install RStudio from https://rstudio.com/products/rstudio/download/#download- Start RStudio and install the packages AER and Wooldridge (which we will need very often). For that purpose go to the lower right window, choose the tab *Packages*, then the tab *Install* and enter AER and then click *Install*. If you are asked during the installation whether you want to compile code, type: no (in the lower left window). Repeat the same for the package Wooldridge- To input code use the upper left window. To execute code, mark the code in the upper left window and click on the tap *Run* at the top of the upper left window- You will see the results in the lower left window- To run the examples from these slides, simply copy the code from the slides (shaded in grey) into the upper left window, mark it and run it------------------------------------------------------------------------##### We want to investigate to what extent the success in an election is determined by the expenditures during the campaign.```{r}#| comment: " "# We use a data set contained in the "Wooldridge" package # We already installed this package, however, if we want to use it in R, # we additionally have to load it with the library() commandlibrary(wooldridge)# Loading the data set "vote1" from the Wooldridge packages with the "data" commanddata(vote1)# printing out the first 6 observation of the data set "vote1" with the command "head()"head(vote1)```------------------------------------------------------------------------##### Plotting the percentage of votes for candidate A versus the share of campaign expenditures from A.```{r, fig.align='center', fig.height=5}plot(voteA ~ shareA, data=vote1)```------------------------------------------------------------------------##### Running a regression of voteA on shareA with the command `lm()` (for linear model)```{r, fig.align='center', fig.height=5}#| comment: " "out <-lm(voteA ~ shareA, data=vote1)# We stored the results in a list with the freely chosen name "out"# With coef(out) we print out the estimated coefficients # Try to interpret the estimated coefficients coef(out)# With fitted(out) we store the fitted valuesyhat <-fitted(out)# With residuals(out) we store the residualsuhat <-residuals(out)```##### Checking the orthogonal property of OLS -- the correlation between explanatory variable $x$ and the residuals $\hat u$.```{r}#| comment: " "round( cor(uhat, vote1$shareA), digits =14) ```------------------------------------------------------------------------```{r, fig.align='center', fig.height=4.8}# Previous plot plus estimated regression line.plot(voteA ~ shareA, data=vote1)abline(out)```------------------------------------------------------------------------##### Plotting residuals. These should show no systematic pattern.```{r, fig.align='center', fig.height=4.8}plot(uhat)abline(0,0)```------------------------------------------------------------------------##### Plotting predicted values versus actual values of voteA. Are predictions biased?```{r, fig.align='center', fig.height=4.8}plot(yhat ~ voteA, data=vote1)# 45° lineabline(0,1)```------------------------------------------------------------------------##### Plotting squared residuals versus fitted values. Useful for detecting a varying variance (heteroscedasticity)```{r, fig.align='center', fig.height=4.8}plot(uhat^2~ yhat, data=vote1)```------------------------------------------------------------------------#### Discussion of exampleThis simple model for the success in an election seems very plausible, however *it suffers from a very common problem*- In this particular example, the *conditional mean independence assumption* is almost certainly violated. Why? - Because the campaign expenditures strongly depend on donations from supporters. The stronger a candidate is in a particular district the more donations he will get and the higher will be the potential campaign expenditures - Hence, we have a reversed causality problem here, $\ x \leftrightarrows y$, or a third variable problem $z_j \rightarrow x \text{ and } z_j \rightarrow y$, which both lead to $E(u|x) \neq 0$ in general - This probably will lead to a strong *overestimation* of the effects of campaign expenditures on votes in this particular case- Note that although $x$ is very likely correlated with unobserved factors in $u$, the example above showed that the **correlation between** $x$ and the sample residuals $\hat u$ **is zero** -- orthogonality property of OLS. Hence, **this fact says nothing** about whether the conditional mean independence assumption is satisfied or not- A possible remedy: *Multiple regression model* (with variables $z$ as additional variables included in the set of explanatory variables) or tying to *identify* the $x \rightarrow y$ relationship with external information (like instrumental variables; we will deal with this approach in @sec-IV)------------------------------------------------------------------------### Measures of Goodness-of-Fit {#sec-R2}How well does the explanatory variable explain the dependent variable?**Measures of Variation**$$SST = \sum\nolimits_{i=1}^n (y_i - \bar y)^2, \quad SSE = \sum\nolimits_{i=1}^n (\hat y_i - \bar y)^2, \quad SSR =\sum\nolimits_{i=1}^n \hat u_i^2 $$- *SST* is *total sum of squares*, represents total variation in the dependent variable- *SSE* is *explained sum of squares*, represents variation explained by regression- *SSR* is *residual sum of squares*, represents variation not explained by regression**Decomposition of total variation**, (because of $y_i = \hat y_i + \hat u_i$, $\sum_i x_i \hat u_i=0$ and $\sum_i \hat u_i=0$)$$SST = SSE + SSR $$ {#eq-SST}**Goodness-of-Fit measure**$$R^2 \ \equiv \ \dfrac {SSE}{SST}\ = \ 1 - \dfrac {SSR}{SST}$$ {#eq-R2}The **R-squared** *measures the fraction of the total variation in* $y$ that is explained by the regression------------------------------------------------------------------------#### Example```{r}#| comment: " "# Running once more the regression of voteA on shareA with the command lm() out <-lm(voteA ~ shareA, data=vote1)# Printing a summary of the regressionsummary(out)# Caution: A high R-squared does not mean that the regression has a causal interpretation!```------------------------------------------------------------------------### Statistical Properties of OLS- The OLS parameters estimates (estimated coefficients) are functions of random variables and thus **random variables themselves**- We are interested in the moments and the distribution of the estimated coefficients, especially in the expectations and variances- Three questions are of particular interest: - Are the OLS estimates unbiased, i.e., $E(\hat \beta_i) = \beta_i \, ?$ - How precise are our parameter estimates, i.e., how large is their variance $\operatorname {Var}(\hat \beta_i) \; ?$ - How are the estimated OLS coefficients distributed?------------------------------------------------------------------------#### Unbiasedness of OLS::: {#thm-unbiased0}## Unbiasedness of OLSGiven a *random sample* and *conditional mean independence* of $u_i$ from $x$ we state:$$E(\hat \beta_0)=\beta_0, \ \ E(\hat \beta_1)=\beta_1$$:::::: {.callout-tip collapse="true" icon="false"}## Proof of @thm-unbiased0From @eq-beta1_hat we have$$\hat \beta_1 = \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) y_i}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \$$ {#eq-1b.11b}We substitute for $y_i = \beta_0 + \beta_1 x_i + u_i$$$\hat \beta_1 = \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) (\beta_0 + \beta_1 x_i + u_i)}{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \ = $$$$\beta_0 \underbrace{ \left[ \dfrac { \frac{1}{n} {\sum_{i=1}^n (x_i - \bar x)} }{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 }\right]}_0 + \beta_1 \underbrace{ \left[ \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) x_i }{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 }\right]}_1 + \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) u_i }{\frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2 } \ \ \Rightarrow$$$$\hat \beta_1 \, = \, \beta_1 + \underbrace{\frac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) u_i } { \frac {1}{n} \sum_{i=1}^n (x_i - \bar x)^2}}_{s_x^2} $$ {#eq-1b.17}Taking the conditional expectation, considering the *conditional mean independence assumption*$$E(\hat \beta_1 | x_1, \ldots, x_n ) \ = \ \beta_1 + \dfrac {1}{s_x^2} \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) \underbrace {E [ u_i | x_1, \ldots, x_n ]}_0 \ = \ \beta_1 \ \ \text{ and }$$$$E(\hat \beta_1) = E_x [E(\hat \beta_1 | x_1, \ldots, x_n )] = E_x(\beta_1)=\beta_1 $$by the *law of iterated expectations*:::**Interpretation of unbiasedness**- The estimated coefficients may be smaller or larger than the true values, depending on the sample which is the result of a random draw- However, **on average**, they will be equal to the true value (on average means with regard to repeated samples)- In a **given sample**, estimates **may differ considerably from true values**------------------------------------------------------------------------#### Variances of the OLS estimates- Depending on the sample, the estimates will be nearer or farther away from the true values- How far can we expect our estimates to be away from the *true* population values on average? (=sampling variability or sampling errors)- Sampling variability is measured by the estimators' *variances*- We need *an additional assumption* to easily calculate these variances:**Homoscedasticity** of $u_i$$$\operatorname{Var}(u_i| x_1, \ldots, x_n) = \sigma^2 $$ {#eq-1b.19}- The values of the explanatory variable must not contain any information about the *variability of the unobserved factors*- Together with the conditional mean independence assumption this furthermore implies that the *conditional* variance of $u$ is also equal to the *unconditional* variance of $u$$$\operatorname{Var}(u) = E_x [ \underbrace{ E ( u^{2} | x)-[ E(u | x)]^{2} }_{ \operatorname{Var}(u_i| x_1, \ldots, x_n) = \sigma^2 } ] = E_x [ E ( u^{2} | x ) ] = E_x(\sigma^2) = \sigma^2$$- The *square root* of $\sigma^2$ is $\sigma$, the **standard deviation** of the error------------------------------------------------------------------------- Example: y = f(x) + u```{r, echo=FALSE, message=FALSE, echo=FALSE}#| fig.align: center#| fig.cap: "Homoscedastic errors; Source: Wooldridge 2020"#| label: fig-fig4library(magick)img =image_read("Homosekedasticity.png")image_scale(img, "550")```------------------------------------------------------------------------- Example: wage = f(education) + u```{r, echo=FALSE, message=FALSE, echo=FALSE}#| fig.align: center#| fig.cap: "Heteroscedastic errors; Source: Wooldridge 2020"#| label: fig-fig5 library(magick)img =image_read("Heteroskedasticity.png")image_scale(img, "550")```------------------------------------------------------------------------::: {#thm-var0}## Variance of OLS estimatorsUnder random sampling, conditional mean independence of $u_i$ from $x$ and homoscedasticity we have$$\operatorname{Var}(\widehat{\beta}_{1} | x_1, \ldots , x_n ) = \dfrac{\sigma^{2}}{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}=\dfrac{\sigma^{2}}{S S T_{x}} = \frac {1}{n} \dfrac {\sigma^2}{s_x^2} $$ {#eq-var_beta1} $$\operatorname{Var} (\widehat{\beta}_{0} | x_1, \ldots , x_n ) = \dfrac{\sigma^{2} \frac {1}{n} \sum_{i=1}^{n} x_{i}^{2}}{\sum_{i=1}^{n}\left(x_{i} - \bar{x}\right)^{2}} = \dfrac{\sigma^{2} \, \bar {x^{2}}} {SST_{x}} = \frac {1}{n} \dfrac {\sigma^2}{s_x^2} \, \bar {x^{2}}$$ {#eq-var_beta0}:::::: {.callout-tip collapse="true" icon="false"}## Proof of @thm-var0From the proof of @thm-unbiased0, we use @eq-1b.17$$\hat \beta_1 \ = \ \ \beta_1 + \dfrac { \frac{1}{n} \sum_{i=1}^n (x_i - \bar x) u_i }{ {s_x^2} } $$Hence, according to @eq-1b.19 and random sampling we have$$\operatorname{Var}(\hat \beta_1|x_1, \ldots, x_n) \ = \ \dfrac { \frac{1}{n^2} \sum_{i=1}^n (x_i - \bar x)^2 \operatorname {Var} ( u_i |x_1, \ldots, x_n )}{ { (s_x^2)^2 } } \ = \ \dfrac {\sigma^2}{SST_x} \ = \ \frac {1}{n} \dfrac {\sigma^2}{s_x^2} $$For the *unconditional* variance we have, (which is rarely used)$$\operatorname{Var}(\hat \beta_1) \ \equiv \ E \left[ \left( \hat \beta_1-E(\beta_1) \right)^2 \right] \ = \ \ E_x \! \left[ \underbrace { E \left( (\hat \beta_1 - \beta_1)^2|x_1, \ldots, x_n \right) }_{\operatorname{Var}(\hat \beta_1|x_1, \ldots, x_n]} \right] \\ = \ E_x \! \left[ \dfrac {\sigma^2}{n \, s_x^2} \right] \ = \ \dfrac {1}{n} E_x \! \left[ \dfrac {\sigma^2}{s_x^2} \right]$$:::The *sampling variability* of the estimated regression coefficients will be the **lower**,- the *smaller* the variability of the unobserved factors $\sigma^2$- the *higher* the variation in the explanatory variable $s_x^2$ - In particular, the ratio of $\sigma / s_x$ is crucial- the *larger* the sample size $n$------------------------------------------------------------------------#### Estimating the variance of error term- According to our homoscedasticity assumption the *variance of the error term* $u$ is independent of the explanatory variables$$\operatorname {Var}(u \, | \, x) = \sigma^2 = \operatorname {Var}(u)$$- However, $\sigma^2$ is usually *unknown*, so we need an **estimator** for this parameter- A natural procedure is to use the **variance of the sample residuals** (note, $\bar {\hat u}_i = 0$, which is an OLS property, see @sec-properties, #2) ::: {.callout-important appearance="simple" icon="false"}$$\hat \sigma^2 = \dfrac {1}{n-2} \sum_{i=1}^n (\hat u_i - \bar {\hat u}_i)^2 \ = \ \dfrac {1}{n-2} \sum_{i=1}^n \hat u_i^2 $$ {#eq-sigma_hat}::: $$\text{and} \quad S.E. \ \equiv \ \hat \sigma \ = \ \sqrt{\hat \sigma^2} $$ {#eq-1b.23}- This estimator turns out to be **unbiased** under or assumptions (see @thm-unbiased1)- Note that we divide by $(n-2)$ and not by $n$ to calculate the average above. The reason is that for calculating $\hat u_i$, we priorly need to estimate two parameters, $\beta_0, \beta_1$. This means, that knowing this two estimated parameters, only $(n-2)$ observations are informative -- if we take these two estimated parameters together with $(n-2)$ observations we could infer the remaining two observations. Therefore, this last two observations contain no additional information - The number $(n-2)$, which is the *number of observations minus the number of estimated model parameters*, is referred to as **degrees of freedom**------------------------------------------------------------------------#### Standard errors for regression coefficientsHaving an estimate for $\sigma^2$ and the standard error *S.E.*, we are able to estimate the standard errors of the parameter estimates**Calculation of standard errors for regression coefficients**Using formulas @eq-var_beta1 and @eq-1b.23 we arrive to$$se(\hat \beta_1) \ = \ \sqrt{\widehat {\operatorname {Var}}(\hat \beta_1 | x_1, \ldots , x_n)} \ = \ \sqrt{\dfrac {\hat \sigma^2}{SST_x} } $$ {#eq-se}- The estimated standard deviations of the regression coefficients are called **standard errors**. They measure how precisely the regression coefficients are estimated------------------------------------------------------------------------##### The following figures should illustrate the theoretical concepts discussed above```{r}#| layout-ncol: 2#| layout-nrow: 2#| label: fig-PRL_regline_unbiased#| fig-cap: "Population regression function (PRF), estimated regression functions, unbiasedness and variance of estimates"#| fig-width: 5.8#| fig-height: 4.5#| fig-subcap: #| - "Population regression function (red) and estimated regression function of a particular sample (blue), 120 observations, compare @fig-fig2 and @fig-fig3"#| - "Unbiasedness: Histogram of 5000 estimates of $\\beta_1$ based on random draws of $u$ and $x$ with 120 observations each. True value of $\\beta_1$ is 0.5"#| - "Variance of $\\hat \\beta_1$: 100 random samples with 120 observations each. PRF in red, estimated regression functions in grey"#| - "Variance of $\\hat \\beta_1$ with smaller ${\\sigma} / {\\sigma_x}$: 100 random samples with 120 observations each. PRF in red, estimated regression functions in grey. Variance of $\\hat \\beta_1$ is much smaller"#| code-fold: true## Monte Carlo simulation for regressions with one explanatory variable ##################### definition of function #########################################sims <-function(n=120, rep=5000, sigx=1, sig=1) {set.seed(13468) # seed for random number generator# true parameters B0 =0 B1 =0.5 OLS <-vector(mode ="list", length = rep) # initialing list for storing results SOLS <-vector(mode ="list", length = rep) # initialing list for storing results OLS1 <-vector(mode ="list", length = rep) # initialing list for storing results SOLS1 <-vector(mode ="list", length = rep) # initialing list for storing results######################### rep loop #################################################for (i in (1:rep)) { x =rnorm(n, mean =0, sd = sigx) u =rnorm(n, mean =0, sd = sig) u1 = u/2 maxx =max(x) minx =min(x) y = B0 + B1*x + u y1 = B0 + B1*x + u1 maxy =max(y) miny =min(y) OLS[[i]] =lm(y ~ x, model =FALSE) OLS1[[i]] =lm(y1 ~ x, model =FALSE) }########################## end rep loop ##################################################################### drawing plots ############################################# scatterplot with true and estimated reg-line for last regressionplot(y ~ x, col="blue")abline(OLS[[i]], col="blue")abline(c(B0,B1), col="red")# rep > 100: histogram of estimated parameter b1if (rep >100) { b1_distribution <-sapply(OLS, function(x) coef(x)[2])hist(b1_distribution, breaks =30, main="") abline(v=B1, col ="red") }# true and up to 100 estimated reg-lines plot(NULL, col="blue", xlim =c(minx*1.1, maxx*1.1), ylim =c(miny*1.1, maxy*1.1), ylab="y", xlab="x")for ( i in1:min(100, rep) ) abline(OLS[[i]], col="lightgrey")points(y ~ x, col="blue", xlim =c(minx*1.1, maxx*1.1), ylim =c(miny*1.1, maxy*1.1))abline(c(B0,B1), col="red")# true and up to 100 estimated reg-lines, smaller sigplot(NULL, col="blue", xlim =c(minx*1.1, maxx*1.1), ylim =c(miny*1.1, maxy*1.1), ylab="y1", xlab="x")for ( i in1:min(100, rep) ) abline(OLS1[[i]], col="lightgrey")points(y1 ~ x, col="blue", xlim =c(minx*1.1, maxx*1.1), ylim =c(miny*1.1, maxy*1.1))abline(c(B0,B1), col="red") }######################### end of function ############################################## Calling function `sims()` with default values for parameterssims()```*** ### Example once moreWe repeat the regression output from our voting example. Look for the new concepts we just discussed in the regression output shown blow.```{r}library(modelsummary)# Running once more the regression of voteA on shareA with the command lm() out <-lm(voteA ~ shareA, data=vote1)modelsummary(list("Vote for candidate A"=out), shape = term ~ statistic,statistic =c('std.error', 'statistic', 'p.value', 'conf.int'), stars =TRUE, gof_omit ="A|L|B|F",align ="ldddddd",output ="gt")```